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Abstract 

Word embedding, which refers to low-dimensional 
dense vector representations of natural words, has 
demonstrated its power in many natural language 
processing tasks. However, it may suffer from the 
inaccurate and incomplete information contained in 
the free text corpus as training data. To tackle this 
challenge, there have been quite a few works that 
leverage knowledge graphs as an additional infor¬ 
mation source to improve the quality of word em¬ 
bedding. Although these works have achieved cer¬ 
tain success, they have neglected some important 
facts about knowledge graphs: (i) many relation¬ 
ships in knowledge graphs are many-to-one, one- 
to-many or even many-to-many, rather than simply 
one-to-one-, (ii) most head entities and tail entities 
in knowledge graphs come from very different se¬ 
mantic spaces. To address these issues, in this pa¬ 
per, we propose a new algorithm named ProjectNet. 
ProjecNet models the relationships between head 
and tail entities after transforming them with dif¬ 
ferent low-rank projection matrices. The low-rank 
projection can allow non one-to-one relationships 
between entities, while different projection matri¬ 
ces for head and tail entities allow them to origi¬ 
nate in different semantic spaces. The experimental 
results demonstrate that ProjectNet yields more ac¬ 
curate word embedding than previous works, thus 
leads to clear improvements in various natural lan¬ 
guage processing tasks. 

1 Introduction 

In recent years, the research on word embedding (or dis¬ 
tributed word representations) has made promising pro¬ 
gresses in many natural language processing tasks Kiiiiiini 
[I5l[l6l[l8]. Different from traditional one-hot discrete rep¬ 
resentations of words, word embedding vectors are dense, 
continuous, and low-dimensional. They are usually trained 
with neural networks on a large-scale free text corpus, such 
as Wikipedia, news articles, and web pages, in an unsuper¬ 
vised manner. 

While word embedding has demonstrated its power in 
many circumstances, it is gradually recognized that conven- 
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tional word embedding techniques may suffer from the in¬ 
complete and inaccurate information contained in the free 
text data. On one hand, due to the restrictive topics and 
coverage of a text corpus, some words might not have suf¬ 
ficient contexts and therefore might not have reliable word 
embeddings. On the other hand, even if a word has suffi¬ 
cient contextual data, the free texts might be inaccurate, thus 
might not provide a semantically precise view of the word. 
As a result, the learned word embedding might be unable to 
carry on the desirable semantic information. To tackle this 
problem, recently some researchers have proposed to lever¬ 
age knowledge graphs, such as WordNet E) and Freebase 
113, as additional data sources to improve word embedding (2j 

1111123]. 

A knowledge graph contains a set of nodes representing 
entities and a set of edges corresponding to the relation¬ 
ships between entities. In other words, a knowledge graph 
can be regarded as a set of triples {h,r,t), where head en¬ 
tity h and tail entity t share the relationship r. In 11231 
ED, in addition to the original likelihood loss on the free texts, 
an extra loss function is imposed to capture the relationships 
in the knowledge graph. Specifically, the additional loss takes 
the form L/c — rt) 11^^ + “ ^1 Ii, where h, t are the em¬ 
bedding vectors of the words (entities) h and t respectively, 
and r is the embedding of the relationship rQ Then the em¬ 
beddings are learned by minimizing the overall loss on both 
free text and knowledge graph. 

While the above approaches have shown certain success, 
we would like to point out their limitations. 

First, the loss function L/c in these works cannot capture 
complex relationships between entities. In particular, it will 
encounter problem when the relationships are one-to-many, 
many-to-one, or many-to-many. For example, r =“cause 
of death” is a many-to-one relationship, since many differ¬ 
ent head entities hi (e.g., hi = “Abraham Lincoln” and /12 = 
“John F Kennedy”) correspond to the same tail entity (e.g., 
t =“assassination by firearm”). In this case, the minimiza¬ 
tion of Lie will enforce the embedding vectors of all head 
entities (e.g., hi, h 2 ) to approach each other, which is clearly 
unreasonable. 

Actually such kind of complex relationships are very com- 


' Please note that throughout the paper we will use bold charac¬ 
ters to represent the embedding vectors for corresponding items. 



mon in knowledge graphs. Take a widely used benchmark 
dataset FB13 11911 . which is a subset of Freebase, as an in¬ 
stance. For every relationship in FB13, we calculate the av¬ 
erage number of head entities corresponding to one tail entity 
and the average number of tail entities corresponding to one 
head entity. Then we obtain the means and standard devia¬ 
tions of such values under different relationships. The overall 
statistical information is listed in Table[T] from which we can 
see that relationships in FBI3 are highly non one-to-one, es¬ 
pecially for the mapping from tail entity to head entity, as 
shown by the large mean value of #Head per Tail. In addi¬ 
tion, the standard deviation for #F[eadper Tail is fairly large, 
indicating that the degrees of non one-to-one mappings from 
tail entity to head entity vary drastically across different re¬ 
lationships. This clearly shows that the issue is very serious 
and we should tackle it in order to learn a reasonable word 
embedding. 


#Tail per Head 

#Head per Tail 

Mean 

Std.Deviation 

Mean 

Std.Deviation 

1.26 

0.23 

2614.17 

9229.75 


Table 1: Number of head entities per tail entity and number of tail 
entities per head entity in FB13. 

Second, the loss function Ljc adopts simple arithmetic op¬ 
erations on the embedding vectors of the head and tail enti¬ 
ties, implying that both entities are located in the same space. 
However, the fact is that head entities are usually more con¬ 
crete and tail entities are more abstract, making it unreason¬ 
able to simply regard them as in a homogeneous space. Still 
use the above example, for the relationship r =“cause of 
death”, all the head entities are real human names whereas 
all the tail entities are abstract reasons of death. What’s more, 
according to Table [T] head and tail entities are not symmet¬ 
ric from the statistics perspective: the number of tail entities 
per head entity is much smaller than that of head entities per 
tail entity, further indicating the heterogeneity nature of head 
and tail entities and suggesting that we should treat them sep¬ 
arately in the mathematical modeling. 

In the literature, there are some research works that try to 
resolve one of the aforementioned issues, however, as far as 
we know, none of the works successfully addressed both is¬ 
sues. For example, in 11221 . it is proposed to project the em¬ 
bedding vectors of both entities onto a relation-dependent hy¬ 
perplane before computing the loss function Lie- However, 
the heterogeneity between head and tail entities is not con¬ 
sidered. Furthermore, the projection matrix used in II22I has 
a fixed rank for all types of relationships, which could not 
express various degrees of non one-to-one mappings. In ||5], 
different transformations are adopted to head and tail entities 
respectively, however, no consideration is taken to address the 
issue of non one-to-one mappings. 

To address the limitations of existing works, in this paper, 
we propose a new algorithm called ProjectNet, which adopts 
different and carefully designed projections to the head and 
tail entities respectively when defining the loss function L/q. 
First, we show that the necessary condition to resolve the is¬ 
sue of non one-to-one mapping is to ensure the projection ma¬ 
trix to be low-rank. In such a way, we can guarantee the trans¬ 


lation distance between the entities to be small after projec¬ 
tion without forcing their embedding vectors to be the same. 
Actually, it can be proven that the TransH model in II22I is our 
special case in the sense that it also adopts a projection matrix 
of low (and fixed) rank. Our model is more general since we 
can explicitly control the rank of the projection matrix, so as 
to adapt to knowledge graphs with different degrees of non 
one-to-one mappings. Second, by using different projection 
matrices for head and tail entities respectively, we can avoid 
the homogeneity assumption on the semantic space and there¬ 
fore build a more flexible and accurate model. For example, 
for the knowledge graph FBI 3, we should adopt a low-rank 
projection matrix for head entities since the number of head 
entities is very large for each tail entity; however, it is safe 
to use a relatively full rank projection matrix for tail entities 
since the number of tail entities is rather small for each head 
entity. 

We have tested the performance of our proposed algorithm 
on several benchmark datasets, and the experimental results 
show that our proposal can significantly outperform the base¬ 
line methods. This indicates the benefit of carefully model¬ 
ing entities and relationships when incorporating knowledge 
graphs into the learning process of word embedding. 

The rest of the paper is organized as following. In Sec¬ 
tion |2] we summarize related works in leveraging knowl¬ 
edge graph to help word embedding. Then in Section 0 the 
detailed model is introduced and its difference with related 
methods is illustrated. After that, the experimental settings 
and results are reported in Section |4l The paper is finally 
concluded in Section 

2 Related Work 

Word embeddings, (a.k.a. distributed word representations) 
are usually trained with neural networks by maximizing the 
likelihood of a text corpus. Based on several pioneering ef¬ 
forts CIItIEQ], the research works in this field have grown 
rapidly in recent years mu HU HU m |6). Among them, 
wordivec msmsi draws quite a lot of attention from the 
community due to its simplicity and effectiveness. An inter¬ 
esting result given by word2vec is that the word embedding 
vectors it produces can reflect human knowledge via some 
simple arithmetic operations, e.g., v{Japan) — v{Tokyo) ^ 
v{France) — v{Pairs). 

However, as aforementioned, word embedding models like 
word2vec usually suffer from the incompleteness and inac¬ 
curacy of the free-text training corpus. To address this chal¬ 
lenge, there are some attempts that leverage additional struc¬ 
tured or unstructured human knowledge to enhance word em¬ 
beddings. Here are some examples. In II 141 [S], the authors 
adopted morphological knowledge to aid the learning of rare 
words and new words. In II24II . the authors used semantic 
relational knowledge between words as a constraint in learn¬ 
ing word embedding vectors. In 123 II . the authors leveraged 
knowledge graphs, the most widely used structured knowl¬ 
edge, to help improve word representations. In particular, 
the authors did not only minimize the loss on the text cor¬ 
pus, but also minimized the loss on the knowledge graph by 
sharing embedding vectors between words and entities. In 








CD, the authors proposed a very similar method to 1123 1 . but 
with a different objective of improving knowledge graph un¬ 
derstanding with the help of text corpus. Actually, both the 
models in 1123 1 and 112II are inspired by the TransE model 
0, which is a state-of-the-art work in the literature of com¬ 
puting distributed representations for knowledge graphs ||5l 
EinS). In TransE, the relational operation between enti¬ 
ties h, t with relationship r is assumed to be a simple linear 
translation, i.e., min ||h -f r — tjH. However, as pointed out 
in the introduction, such a simple formulation cannot handle 
the non one-to-one mappings between entities. To tackle the 
problem, in 1122 1 . the authors proposed a simple projection 
method named TransH. We will review the detailed mathe¬ 
matical forms of these models in Section l33] and discuss their 
relationship with our proposal. 

3 The ProjectNet Algorithm 

In this section, we introduce our proposed ProjectNet model 
in details. In general, following Il23l 1211 . given a training 
text corpus V and a set 1C of triples in the form (head entity, 
relation, tail entity) extracted from a knowledge graph, our 
model jointly minimize a linear combination of the loss items 
on both text and knowledge; 

L = aLx> + (1 — a)L]c, ( 1 ) 

where a € [0,1] is used to trade off the two loss terms. Lxi 
and L/C share the same parameters, i.e., the embedding vec¬ 
tors for words and their corresponding entities are the same. 
In the following subsections, we will introduce the text model 
to specify Lp and the knowledge model to specify Lie- 

3.1 Text Model 

Similar to 1123 ll2D . we leverage the Skip-Gram model m as 
the text model. In Skip-Gram, the probability of observing 
the target word wq given its context word wj is modeled as 
P(wo\wi) = exp(wo wi) —^ where w € TZ'^ and w' g 

TZ‘^ denote the input and output embedding vectors for word 
w respectively, V is the dictionary, and d is dimension of the 
embedding. 

Given the training corpus V consisting of |X>| token words 
{pi, ■ ■ ■ ,pk, ■ ■ ■ ,P\v\}, the loss Lt> is specified by: 

\v\ 

Lx> — E E PiPk\Pk-\-j), (2) 

k=l ,M},j^0 

where 2M is the size of the sliding window. As it is ex¬ 
pensive to directly minimize Lxi due to the denominator of 
P{wo\wi), we adopt the negative sampling strategy II 161 to 
boost the computation efficiency. 

3.2 Knowledge Model 

The knowledge model in ProjectNet is based on an asymmet¬ 
ric low-rank projection that projects the original entity em¬ 
bedding vectors into a new semantic space. The projection 
is designed to be asymmetric in order to handle the hetero¬ 
geneity between head and tail entities, and is designed to be 
low-rank in order to deal with non one-to-one relationships in 
the knowledge graphs. 


Asymmetric Projection 

As aforementioned, the head and tail entities in knowledge 
graphs are usually very different, from both semantic and sta¬ 
tistical perspectives. Therefore, we argue that it is unreason¬ 
able to adopt the same projection to these two kinds of en¬ 
tities (as TransH II22I does). Instead, it would be better to 
adopt different projection matrices, denoted as Lr G 
and Rr G j^dxd respectively, to the head and tail entities. 
Hence, given a triple {h,r,t), the original embedding vectors 
for h and t will be transformed to h' and t' as follows, 

h' = Lrh, t' = Rrt. (3) 


Based on the transformed embeddings, we define a scor¬ 
ing function fd to reflect the confidence level that the triple 
{h, r, t) is true: 

fd{h,r,t) = ||h'-f r - t '||2 = \\Lr\iP v - Rrt\\l. (4) 

Then we adopt a margin based ranking loss to distinguish the 
golden relationship triples from randomly corrupted triples; 

i/c = E E [l-fd(h',r',t') +fd(h,r,t)\ + , 

{h,r,t) {h'{h,r,,t) 

(5) 

where [a;]+ = max( 0 ,a;), 7 > 0 is the margin value, 
N{h,r,t) is the set of all the corrupted triples built for the 
triple {h,r,t), and Lr and Rr will be specified in 

Low-Rank Projection 

As mentioned in the introduction, many relationships in the 
knowledge graphs are non one-to-one. In this case, in order 
to achieve reasonable results during the minimization of L/c 
defined above, it is necessary to constrain the projection ma¬ 
trices Lr and Rr to be low-rank, which is described in the 
following proposition. 

Proposition 3.1 Once linear projections are imposed to head 
and tail entities, the necessary condition to overcome the non 
one-to-one mapping problem is that the projection matrices 
Lr and Rr should not be full-ranked. 

Proof Consider the following least-square problem w.r.t. the 
optimization variable h.' 

min IjL^h — CII 2 , (6) 

where Lrh. = h' and we regard c = t' — r as a constant vec¬ 
tor. It is easy to obtain that the optimal solution h* satisfies 
the following linear system: 

LjLrh*=Ljc. (7) 

To avoid the non one-to-one mapping problem, the above 
equation must have multiple solutions. Then it is nec¬ 
essary that L^Lr is a low-rank matrix. In addition, as 
rank{L^Lr) = rank{Lr), the linear projection matrix Lr 
must not be full-rank either. The same conclusion holds for 
the projection matrix Rr for the tail entity. | 

Given the above proposition, we use the following tricks 
to ensure that Lr and Rr are low-rank matrices (whose ranks 
are rriL and mu respectively, with niL < d and mu < d): 


Lr = , i?r = Cr*^ J (8) 


i=l 


i=l 



where are scalars, and pr*^, qr*\ Or\ are all 

d-dimensional real vectors, the outer products of which con¬ 
stitute {rriL + rnji) rank 1 matrices Pr'^qr'^ and Or'^Sr’^ . 
For simplicity, we set the rank of all the left matrices to 
be the same (m-z,) and the rank of all the right matrices Rr to 
be the same (iriR). Please note we can also specify different 
ranks for different relationships r. We leave the correspond¬ 
ing discussions to the future work. 

3.3 Discussions 

In this section, we discuss the connections of our proposed 
ProjectNet algorithm with a few previous works and show 
that they are special cases of ProjectNet. 

RNet. RNet refers to the knowledge models proposed 
in ED and HU]. In fact, both models in the two works 
try to minimize the same scoring function; fd{h,r,t) = 
||h + r — till. Their only difference lies in how fd is min¬ 
imized. In II 23 I . a large margin ranking loss is adopted for 
the minimization of fd{h,r,t), whereas in II21I . an approx¬ 
imate softmax loss is used. It is clear that such a scoring 
function fd{h,r,t) cannot handle either the non one-to-one 
relationships between entities or the heterogeneity between 
head and tail entities. To state it more formally, let us con¬ 
sider the relationship triples {hi,r,t),i € 1, - ■ ■ ,N, where 
all head entities hi have the same relationship r with tail en¬ 
tity t. In the ideal case, if all fdihi, r, t) are fully minimized, 
we will have hi = t — r, Vi € 1, • • • , iV, which implies that 
hi = h 2 = • • • = hN. It means that all the embedding 
vectors for the head entities are the same, which is 

clearly unreasonable. We may encounter similar issues for 
one-fo-many relationships {(/i, r, tj)}j and many-to-many re¬ 
lationships {{hi, r, tj)}ij. 

Note that RNet corresponds to Lr = Rr = Idxd in ^ and 
since the identity matrix Idxd can be written in the form of 
(|8]l, RNet can be regarded as a special case of ProjectNet. 

TransH. TransH II22I is proposed to overcome the non one- 
to-one mapping problem. It first projects the entity embed¬ 
ding vectors h and t onto a hyperplane w.r.t the relationship 
r, and then the projected vectors hj^ and t are used to define 
the scoring function fd- Specifically, 

h L = h — Wr^hWr, t I = t — Wr^tWr, 

fd{h,r,t) = \\h^+r-t^\\l 

where Wr S TZ‘^ is the normal vector of the hyperplane with 
unit length (i.e., Wr • r = 0 and ||wr ||2 = !)■ 

Our proposed ProjectNet model differs from TransH in two 
ways: (i) we adopt different projections to head and tail en¬ 
tities; (ii) we adopt general projection matrices rather than a 
hyperplane based projection. Actually, TransH (|9]l can be re¬ 
garded as a special case of ProjectNet (|5]l, as shown below. 
Starting from (|9]l, we have 

hx = h — Wr^hWr = h — WrWr^h = (/ — WfWr^jh. 

( 10 ) 

Hence, by substituting Lr = {I — WrWr^) in ©(llli, we get 
TransH. We still need to check whether Lr = {I — WrWr^) 
can be written in the form of (l8]l, i.e., the weighted sum of 


rriL rank-1 matrices, where rriL < d. We answer this ques¬ 
tion in the following two steps: (i) As Lr — {I — WrWr^) 
is an idempotent matrix (i.e. LrLr = Lr) and Wr is a unit 
length vector, it holds that rank{Lr) = trace{Lr) = d — 1 
m. Therefore, Lr has d — 1 non-zero eigenvalues. Fur¬ 
thermore, by observing that the eigenvalues of WrWr^ are 
0 and 1, we can conclude that Lr = (/ — WrWr^) has 1 
as one of its eigenvalues, corresponding to c? — 1 linearly 
independent eigenvectors, and 0 as its another eigenvalue, 
corresponding to one eigenvector, (ii) Further considering 
that Lr is a real symmetric matrix, we can decompose Lr as 

Lr = UrT.rUj, where Ur = G 

and Sj. = diag{\, 1, • • • , 1,0) G The first d — 1 

columns of Ur are all the unit-length eigenvec¬ 

tors of Lr corresponding to eigenvalue 1 and stores all the 

eigenvalues of Lr- Thus we can write Lr = YltZi Ur'^Ur'^ ■ 
The same procedure holds for the relation between Rrt and 
tj^. Then according to the above discussions, we can obtain 
the following proposition. 

Proposition 3.2 In the knowledge model of ProjectNet 
letting rriL = niR = d — 1, gbiZ = (jZ = 1 
and Pr'^ = qr'^ = = Ur'\ where ^ is the 

eigenvector of the matrix I — WrWr^ with unit length, 
i = 1,2, ■ ■ ■ ,d — 1, we can obtain the TransH model m- 
SE. SE 15] adopts the following scoring function: 

fd{h,r,f) = \\Lrh.-Rrt\\i. (11) 

SE looks very similar to our proposed knowledge model. 
However, there is a key difference: SE does not add the low- 
rank constraints to the matrices Lr and Rr- In other words, 
they fix the rank of these two matrices to be full, while in 
our model the rank of the matrices is a variable. Therefore 
our model is more general than SE and can handle the non 
one-to-one relationship when the rank is low while SE can¬ 
not since its rank is always full. In this sense, we could also 
regard SE as a special case of our proposed ProjectNet model. 

TransR. in) TransR treats relationships and entities as 
different objects and thus separates their embeddings into dif¬ 
ferent spaces, 

fd{h,r,t) = \\Mrh-\-r - Mrt\\l. (12) 

Different with our formulation (|8]l0, they did not add the 
low-rank constraint to matrix Mr (or we say that it sets the 
matrix Mr to be full-rank). In addition, TransR adopts the 
same transformation matrices to head and tail entities, by as¬ 
suming that they are located in the same space. Therefore, the 
knowledge model in our ProjectNet algorithm is more general 
than TransR, and can include it as our special case. 

4 Experiments 

In this section, we conduct a set of experiments to verify the 
effectiveness of the ProjectNet model. 

4.1 Experiments Setup 

Training Data 

Eor the free text corpus, we used a public snapshot of English 
Wikipedia named enwik9^ The corpus contains about 120 

^http:// mattmahoney. ne t/dc/enwik9 .zip 



million word tokens. We removed digital words and words 
with frequency less than 5. Then we leveraged a knowl¬ 
edge graph FB13 QS) to impose relationships onto those en¬ 
tities covered by enwik9. Since FB13 contains many enti¬ 
ties whose names have multiple words, in enwik9 we merged 
these words into phrases and regarded both single words and 
phrases as embedding units in the dictionary. Finally the dic¬ 
tionary size is about 230fc. 

Baseline Methods 

We consider the following algorithms as our baselines (we 
used the codes released by the authors of these works for im¬ 
plementation); 

1. Skip-Gram (SG); the original Skip-Gram model in 
wordivec, corresponding to a = 0 in O- 

2. RNet: the joint embedding model in 1123 1 and 1121 1 . 
which adopts the objective min||h -f r — tjH in the 
knowledge modelQ 

3. Skip-Gram-rTransH (SG-rTransH); the combination 
of Skip-Gram (for the text model) and TransH (for the 
knowledge model). According to the discussions in Sec¬ 
tion [33] this baseline is a special case of ProjectNet. 

Parameter Setting 

In our experiments, we set the embedding size to d = 100. 
Stochastic Gradient Descent(SGD) is used to train all the 
models. We set the initial learning rate to be 0.025 and lin¬ 
early dropped it during the training process. For the knowl¬ 
edge model in ProjectNet, we initialized the projection ma¬ 
trices Lr and Rr to be diagonal matrices with randomly as¬ 
signed 0,1 elements (with rriL and mu non-zero elements 
respectively). For tol and mu, we varied their values ac¬ 
cording to the set {10, 20, • • • , 80, 90,95,100}. For all the 
joint embedding models, we varied the trade-off parameter 
a in ([T]i according to the set {0.01,0.05,0.1,0.2,0.5}. The 
margin value is set to 7 = 1. 

We used two tasks to evaluation our algorithm and the 
baseline models, one is the analogical reasoning task and the 
other is the word similarity task. The corresponding experi¬ 
ments results are shown in the following two subsections. 

4.2 Analogical Reasoning Task 

The analogical reasoning task is a word relationship inference 
task proposed in II16I . It consists of several quadruple word 
questions a:b,c:d, in which the relationship between word a 
and b is the same as that between c and d. For instance, (o : 
b,c : d) — {Berlin : Germany, Paris : France) and 
the relationship r is capital-countries. The task aims to infer 
word d given words a, b, and c using their word embedding 
vectors. To be more concrete, the inferred word d is given by 
d — argmaxujgv cosjne(b — a -f c, w). Once d — d, the 
result on this quadruple word question is right; otherwise, it 
is wrong. 

^As aforementioned, the models in the two papers differ only in 
the loss function (i.e., ranking loss vs. approximate softmax loss). 
Hence, we unify these two models using the name RNet and report 
the better performance of the two loss functions. 


To construct the test set for the analogical reasoning task, 
we randomly sampled 1% triples from FB13, and filtered 
them according to the dictionary of enwik9^ This test set 
consists of about 20k questions belonging to 7 non one-to- 
one relationships. The detailed statistics for this test dataset 
can be found in Table |2] 

Then, we went through the remaining triples in FB13 and 
removed all those triples containing overlapped entities with 
the test data. In this way, we obtained a training set with 
roughly 76k triples, which has no overlap with the test set 
in either relationship triples or entities. The goal of doing 
so is to examine whether the free text corpus can act as a 
bridge between known and unknown entities, so as to verify 
the necessity of jointly embedding text and knowledge into 
the distributed representation space. 

For ProjectNet, as we imposed — Rr^ = —r = 
LrC — Rrd instead of a — b = c — d, we took a two- 
step approach instead of directly using the original word vec¬ 
tors to perform the analogical reasoning task: (i) we chose 
an optimal relationship r* that best describes the relationship 
between a and b, i.e., r* = argmin^ WL^a -f r — i?rb|| 2 ; 
(ii) under r*, we chose the answer word d according to 
d = argmin.u,gv ||Tr*c -I- r* — i?r*w|| 2 . The same eval¬ 
uation method was also applied to SG-nTransH as well. 

The experimental results are listed in Table |2l from which 
we have the following observations; 

• All the knowledge based models (RNet, SG-nTransH, 
and ProjectNet) outperform the original SG model, in¬ 
dicating that the quality of word embedding can be im¬ 
proved by leveraging knowledge graphs. 

• The two models that take non one-to-one mappings into 
consideration (i.e., SG-nTransH and ProjecNet) are supe¬ 
rior to RNet, showing the necessity of modeling the non 
one-to-one mappings into the loss functions. 

• Among all the models, ProjectNet achieves the best 
performance in all the seven subtasks. For the over¬ 
all accuracy, it achieves over 30% relative gain than 
SG-kTransH. This well demonstrates the advantages of 
our proposed model. 

Sensitivity to different ranks 

The best performance of ProjectNet was obtained with a = 
0.2, TOi = 50, and mu = 90. To show the influence of 
the ranks of the projection matrices, in Figure [1] we plotted 
two curves that reflect the performance of ProjectNet w.r.t. 
different rank values tol and to/j: one curve corresponds to 
changing mn while fixing m^ = 50, and the other corre¬ 
sponds to changing mz, while fixing mu = 90. From the fig¬ 
ure, we have the following observations, (i) The performance 
becomes bad when the rank is too low. This is because in 
this case the model expressiveness becomes poor due to small 
number of free parameters in the projection matrices, (ii) 
For the projection matrix for head entities, medium values of 
mL correspond to the better performances (the dashdot line), 

did not use the analogical reasoning dataset given in CD 
because this dataset is too special in the sense that almost all the 
relationships in it are one-to-one mappings. 




Relationship 

#Question 

SG 

RNet 

SG+TransH 

ProjectNet 

cause ofMeath 

4290 

3.29% 

4.31% 

7.55% 

10.84% 

nationality 

870 

14.60% 

14.14% 

14.82% 

17.47% 

gender 

650 

67.08% 

59.38% 

75.54% 

84.62% 

profession 

6320 

3.13% 

5.78% 

8.66% 

13.42% 

institution 

4556 

1.54% 

3.03% 

4.92% 

6.72% 

ethnicity 

342 

16.96% 

15.50% 

15.79% 

18.71% 

religion 

3192 

13.00% 

12.47% 

15.88% 

22.06% 

Total 

20220 

7.33% 

8.15% 

11.24% 

15.28% 


Table 2: Accuracy of different models on analogical reasoning task. 


while for the projection matrix for tail entities, higher values 
of mu lead to better performances (the solid line). This result 
is consistent with the statistical information in Table [T] the 
degree of non one-to-one mappings for head entities is much 
higher than that for tail entities, suggesting a lower rank of 
projection matrix for head entities. 



Figure 1: Accuracy w.r.t. different head and tail ranks. The dashdot 
line records the accuracy varying with different head ranks when tail 
rank is fixed to 90. The solid line records the accuracy varying with 
different tail ranks when head rank is fixed to 50. 

4.3 Word Similarity Task 

Word similarity is a task to investigate whether the simi¬ 
larity computed from word embedding vectors is consistent 
with human-labeled word similarity. We used three word- 
similarity tasks in our experiments, namely Word Similarity 
353 (WS353) 0, SCWS 113 and Rare Word (RW) lfT4l . 
There are 353, 2003, and 2034 word pairs in these datasets 
respectively. From the word embedding vectors, we obtain 
the similarity scores (e.g., cosine similarity) for each word 
pair, based on which a ranked list is derived on the word pairs. 
Then the generated ranked list is compared to the ranked list 
produced by the ground-truth similarity scores assigned by 
human labelers. To evaluate the consistency between two 
ranking lists, we used Spearman’s Rank Correlation (denoted 
as p G [—1,1]). Higher p corresponds to better word embed¬ 
ding vectors. 

Table [3] summarizes the results. For ProjectNet and 
SG-hTransH, the word embedding vectors were directly used 
to compute the similarity scores, which is different from the 
analogical reasoning task. This is because there is no ex¬ 
plicit relationship available in the evaluation process. The 
best performances of ProjectNet on the three datasets were 
obtained with the parameters setting to (m^ = 50,m/i = 
95, a = 0.05), {rriL = 40, mu = 90, a = 0.01), and {mL = 


60,to/{ = 95 ,0; = 0.05) respectively. Table [3 reveals that 
ProjetNet achieves the best performance on all the datasets, 
which further indicates that ProjetNet produce higher quality 
word embedding vectors than the baseline methods. 


Task/Model 

SG 

RNet 

SG-i-TransH 

ProjectNet 

IV5353 

0.647 

0.661 

0.666 

0.684 

SCWS 

0.610 

0.614 

0.618 

0.630 

RW 

0.179 

0.184 

0.187 

0.198 


Table 3: Spearman’s Rank Correlation(p) on three Word Similarity 
Datasets: WS353, SCWS, and RW. Each p is reported as the average 
value of five repeated runs. 


5 Conclusions and Future Work 

In this paper we proposed a novel word embedding algorithm 
called ProjetNet, which leverages knowledge graphs to im¬ 
prove the quality of word embedding. In ProjetNet, we adopt 
different asymmetric low rank projections to head and tail en¬ 
tities in an entity-relationship triple, thus successfully main¬ 
tain both non one-to-one mapping and heterogenous head/tail 
entities properties of knowledge graph. Experimental results 
demonstrate that ProjetNet significantly outperforms previ¬ 
ous embedding models. 

For the future work, we plan to apply the proposed ap¬ 
proach to fulfill knowledge mining tasks, such as triplet clas¬ 
sification and link prediction II22I . In addition, we plan to 
use the word embedding vectors generated by ProjectNet in 
some real-world applications such as document classification 
and web search ranking. 
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